{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " # Manipulación, Limpieza y Exploración de Datos\n",
    "\n",
    "Este cuaderno tiene como objetivo revisar diversas técnicas de manipulación y limpieza de datos. Posterior al procesamiento de datos, realizaremos un análisis exploratorio de los datos para obtener insights valiosos. Es importante tener en cuenta que usaremos el dataset `bank_marketing` del repositorio de Machine Learning de UCI. Este conjunto de datos nos proporcionará una gran oportunidad para aplicar y practicar las técnicas que aprenderemos.\n",
    "\n",
    "## Contenidos\n",
    "\n",
    "A lo largo de este cuaderno, abordaremos los siguientes temas:\n",
    "\n",
    "- Selección y filtrado de datos: Aprenderemos cómo seleccionar y filtrar datos de manera eficiente en Python.\n",
    "- Manejo de valores nulos: Trataremos con valores nulos y aprenderemos técnicas para manejarlos.\n",
    "- Transformación de columnas: Veremos cómo transformar y manipular columnas en un DataFrame.\n",
    "- Estadísticas descriptivas: Obtendremos estadísticas descriptivas de nuestros datos para entender mejor su distribución y tendencias.\n",
    "- Correlaciones: Exploraremos las correlaciones entre diferentes variables en nuestros datos.\n",
    "\n",
    "## Ejemplos prácticos\n",
    "\n",
    "Aplicaremos estas técnicas en ejemplos prácticos para mejorar la calidad de los datos y para identificar tendencias y patrones en datos de ventas y marketing.\n",
    "\n",
    "## Dataset\n",
    "\n",
    "Para este cuaderno, utilizaremos el dataset `bank_marketing` del repositorio de Machine Learning de UCI. Este conjunto de datos nos proporcionará una gran oportunidad para aplicar y practicar las técnicas que aprenderemos."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>job</th>\n",
       "      <th>marital</th>\n",
       "      <th>education</th>\n",
       "      <th>default</th>\n",
       "      <th>balance</th>\n",
       "      <th>housing</th>\n",
       "      <th>loan</th>\n",
       "      <th>contact</th>\n",
       "      <th>day_of_month</th>\n",
       "      <th>month</th>\n",
       "      <th>duration</th>\n",
       "      <th>campaign</th>\n",
       "      <th>pdays</th>\n",
       "      <th>previous</th>\n",
       "      <th>poutcome</th>\n",
       "      <th>HasTermDeposit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>58</td>\n",
       "      <td>management</td>\n",
       "      <td>married</td>\n",
       "      <td>tertiary</td>\n",
       "      <td>no</td>\n",
       "      <td>2143</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>may</td>\n",
       "      <td>261</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44</td>\n",
       "      <td>technician</td>\n",
       "      <td>single</td>\n",
       "      <td>secondary</td>\n",
       "      <td>no</td>\n",
       "      <td>29</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>may</td>\n",
       "      <td>151</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>33</td>\n",
       "      <td>entrepreneur</td>\n",
       "      <td>married</td>\n",
       "      <td>secondary</td>\n",
       "      <td>no</td>\n",
       "      <td>2</td>\n",
       "      <td>yes</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>may</td>\n",
       "      <td>76</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>47</td>\n",
       "      <td>blue-collar</td>\n",
       "      <td>married</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>1506</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>may</td>\n",
       "      <td>92</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>single</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "      <td>1</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>may</td>\n",
       "      <td>198</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age           job  marital  education default  balance housing loan  \\\n",
       "0   58    management  married   tertiary      no     2143     yes   no   \n",
       "1   44    technician   single  secondary      no       29     yes   no   \n",
       "2   33  entrepreneur  married  secondary      no        2     yes  yes   \n",
       "3   47   blue-collar  married        NaN      no     1506     yes   no   \n",
       "4   33           NaN   single        NaN      no        1      no   no   \n",
       "\n",
       "  contact  day_of_month month  duration  campaign  pdays  previous poutcome  \\\n",
       "0     NaN             5   may       261         1     -1         0      NaN   \n",
       "1     NaN             5   may       151         1     -1         0      NaN   \n",
       "2     NaN             5   may        76         1     -1         0      NaN   \n",
       "3     NaN             5   may        92         1     -1         0      NaN   \n",
       "4     NaN             5   may       198         1     -1         0      NaN   \n",
       "\n",
       "  HasTermDeposit  \n",
       "0             no  \n",
       "1             no  \n",
       "2             no  \n",
       "3             no  \n",
       "4             no  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 45211 entries, 0 to 45210\n",
      "Data columns (total 17 columns):\n",
      " #   Column          Non-Null Count  Dtype \n",
      "---  ------          --------------  ----- \n",
      " 0   age             45211 non-null  int64 \n",
      " 1   job             44923 non-null  object\n",
      " 2   marital         45211 non-null  object\n",
      " 3   education       43354 non-null  object\n",
      " 4   default         45211 non-null  object\n",
      " 5   balance         45211 non-null  int64 \n",
      " 6   housing         45211 non-null  object\n",
      " 7   loan            45211 non-null  object\n",
      " 8   contact         32191 non-null  object\n",
      " 9   day_of_month    45211 non-null  int64 \n",
      " 10  month           45211 non-null  object\n",
      " 11  duration        45211 non-null  int64 \n",
      " 12  campaign        45211 non-null  int64 \n",
      " 13  pdays           45211 non-null  int64 \n",
      " 14  previous        45211 non-null  int64 \n",
      " 15  poutcome        8252 non-null   object\n",
      " 16  HasTermDeposit  45211 non-null  object\n",
      "dtypes: int64(7), object(10)\n",
      "memory usage: 5.9+ MB\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "None"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "## Carga de datos\n",
    "\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "\n",
    "data=pd.read_csv('bank_marketing.csv')\n",
    "\n",
    "# Mostrar las primeras filas\n",
    "display(data.head())\n",
    "\n",
    "# Información general del DataFrame\n",
    "display(data.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Limpieza de datos\n",
    "\n",
    "La limpieza de datos es un paso crucial en el proceso de análisis de datos. Los datos limpios son esenciales para obtener resultados precisos y confiables. En esta sección, aprenderemos a limpiar y manipular datos utilizando Python y la biblioteca Pandas.\n",
    "\n",
    "```{note}\n",
    "Pandas es una biblioteca de Python que proporciona estructuras de datos y herramientas de análisis de datos. Es ampliamente utilizada en la comunidad de ciencia de datos y es una de las bibliotecas más populares para el análisis de datos en Python.\n",
    "\n",
    "Para el proceso de limpieza y manipulación de datos, utilizaremos las siguientes funciones y métodos de Pandas:\n",
    "\n",
    "- `isnull()`: Comprueba si hay valores nulos en un DataFrame.\n",
    "\n",
    "- `dropna()`: Elimina filas o columnas con valores nulos de un DataFrame.\n",
    "\n",
    "- `fillna()`: Rellena los valores nulos con un valor específico.\n",
    "\n",
    "- `astype()`: Convierte el tipo de datos de una columna a un tipo de datos específico.\n",
    "\n",
    "- `describe()`: Muestra estadísticas descriptivas de un DataFrame.\n",
    "\n",
    "- `corr()`: Calcula la correlación entre columnas de un DataFrame.\n",
    "\n",
    "- `plot()`: Crea gráficos a partir de los datos de un DataFrame.\n",
    "\n",
    "- `value_counts()`: Cuenta los valores únicos en una columna de un DataFrame.\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age                   0\n",
       "job                 288\n",
       "marital               0\n",
       "education          1857\n",
       "default               0\n",
       "balance               0\n",
       "housing               0\n",
       "loan                  0\n",
       "contact           13020\n",
       "day_of_month          0\n",
       "month                 0\n",
       "duration              0\n",
       "campaign              0\n",
       "pdays                 0\n",
       "previous              0\n",
       "poutcome          36959\n",
       "HasTermDeposit        0\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Identificar valores faltantes\n",
    "display(data.isnull().sum())\n",
    "\n",
    "# Si hay valores faltantes, aplicar estrategias de imputación o eliminación"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "job\n",
       "blue-collar      9732\n",
       "management       9458\n",
       "technician       7597\n",
       "admin.           5171\n",
       "services         4154\n",
       "retired          2264\n",
       "self-employed    1579\n",
       "entrepreneur     1487\n",
       "unemployed       1303\n",
       "housemaid        1240\n",
       "student           938\n",
       "NaN               288\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Variable job\n",
    "\n",
    "data['job'].value_counts(dropna=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "job\n",
      "blue-collar      9732\n",
      "management       9458\n",
      "technician       7597\n",
      "admin.           5171\n",
      "services         4154\n",
      "retired          2264\n",
      "self-employed    1579\n",
      "entrepreneur     1487\n",
      "unemployed       1303\n",
      "housemaid        1240\n",
      "student           938\n",
      "NaN               288\n",
      "Name: count, dtype: int64\n",
      "job\n",
      "blue-collar      10020\n",
      "management        9458\n",
      "technician        7597\n",
      "admin.            5171\n",
      "services          4154\n",
      "retired           2264\n",
      "self-employed     1579\n",
      "entrepreneur      1487\n",
      "unemployed        1303\n",
      "housemaid         1240\n",
      "student            938\n",
      "Name: count, dtype: int64\n",
      "job\n",
      "blue-collar      9732\n",
      "management       9458\n",
      "technician       7597\n",
      "admin.           5171\n",
      "services         4154\n",
      "retired          2264\n",
      "self-employed    1579\n",
      "entrepreneur     1487\n",
      "unemployed       1303\n",
      "housemaid        1240\n",
      "student           938\n",
      "Desconocido       288\n",
      "Name: count, dtype: int64\n",
      "job\n",
      "blue-collar      9732\n",
      "management       9458\n",
      "technician       7597\n",
      "admin.           5171\n",
      "services         4154\n",
      "retired          2264\n",
      "self-employed    1579\n",
      "entrepreneur     1487\n",
      "unemployed       1303\n",
      "housemaid        1240\n",
      "student           938\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "## Estrategias de imputación\n",
    "\n",
    "# Imputación con la moda\n",
    "\n",
    "data_1=data.copy()\n",
    "data_1['job'].fillna(data_1['job'].mode()[0], inplace=True)\n",
    "\n",
    "# Imputación con la categoría 'Desconocido'\n",
    "\n",
    "data_2=data.copy()\n",
    "data_2['job'].fillna('Desconocido', inplace=True)\n",
    "\n",
    "\n",
    "# Eliminación de las filas con valores faltantes\n",
    "\n",
    "data_3=data.copy()\n",
    "data_3.dropna(subset=['job'], inplace=True)\n",
    "\n",
    "# Comparación de las estrategias de imputación\n",
    "\n",
    "print(data['job'].value_counts(dropna=False))\n",
    "\n",
    "print(data_1['job'].value_counts(dropna=False))\n",
    "\n",
    "print(data_2['job'].value_counts(dropna=False))\n",
    "\n",
    "print(data_3['job'].value_counts(dropna=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Tratamiento de variables nulas numéricas\n",
    "\n",
    "Ejemplo=pd.DataFrame({'A':[1,2,None,None,3,4,None,5,6,None,7,8,None,9,10,None],})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A\n",
       "0    1.0\n",
       "1    2.0\n",
       "2    NaN\n",
       "3    NaN\n",
       "4    3.0\n",
       "5    4.0\n",
       "6    NaN\n",
       "7    5.0\n",
       "8    6.0\n",
       "9    NaN\n",
       "10   7.0\n",
       "11   8.0\n",
       "12   NaN\n",
       "13   9.0\n",
       "14  10.0\n",
       "15   NaN"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Ejemplo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A\n",
       "0    1.0\n",
       "1    2.0\n",
       "2    5.5\n",
       "3    5.5\n",
       "4    3.0\n",
       "5    4.0\n",
       "6    5.5\n",
       "7    5.0\n",
       "8    6.0\n",
       "9    5.5\n",
       "10   7.0\n",
       "11   8.0\n",
       "12   5.5\n",
       "13   9.0\n",
       "14  10.0\n",
       "15   5.5"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Imputación con la media\n",
    "\n",
    "Ejemplo_1=Ejemplo.copy()\n",
    "Ejemplo_1['A'].fillna(Ejemplo_1['A'].mean(), inplace=True)\n",
    "Ejemplo_1\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>5.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A\n",
       "0    1.0\n",
       "1    2.0\n",
       "2    5.5\n",
       "3    5.5\n",
       "4    3.0\n",
       "5    4.0\n",
       "6    5.5\n",
       "7    5.0\n",
       "8    6.0\n",
       "9    5.5\n",
       "10   7.0\n",
       "11   8.0\n",
       "12   5.5\n",
       "13   9.0\n",
       "14  10.0\n",
       "15   5.5"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Imputación con la mediana\n",
    "\n",
    "Ejemplo_2=Ejemplo.copy()\n",
    "Ejemplo_2['A'].fillna(Ejemplo_2['A'].median(), inplace=True)\n",
    "Ejemplo_2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A\n",
       "0    1.0\n",
       "1    2.0\n",
       "2    1.0\n",
       "3    1.0\n",
       "4    3.0\n",
       "5    4.0\n",
       "6    1.0\n",
       "7    5.0\n",
       "8    6.0\n",
       "9    1.0\n",
       "10   7.0\n",
       "11   8.0\n",
       "12   1.0\n",
       "13   9.0\n",
       "14  10.0\n",
       "15   1.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Imputación con la moda\n",
    "\n",
    "Ejemplo_3=Ejemplo.copy()\n",
    "Ejemplo_3['A'].fillna(Ejemplo_3['A'].mode()[0], inplace=True)\n",
    "Ejemplo_3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-999.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-999.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-999.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>-999.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>-999.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>-999.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        A\n",
       "0     1.0\n",
       "1     2.0\n",
       "2  -999.0\n",
       "3  -999.0\n",
       "4     3.0\n",
       "5     4.0\n",
       "6  -999.0\n",
       "7     5.0\n",
       "8     6.0\n",
       "9  -999.0\n",
       "10    7.0\n",
       "11    8.0\n",
       "12 -999.0\n",
       "13    9.0\n",
       "14   10.0\n",
       "15 -999.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Imputación con un valor constante\n",
    "\n",
    "Ejemplo_4=Ejemplo.copy()\n",
    "Ejemplo_4['A'].fillna(-999, inplace=True)\n",
    "Ejemplo_4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       A\n",
       "0    1.0\n",
       "1    2.0\n",
       "4    3.0\n",
       "5    4.0\n",
       "7    5.0\n",
       "8    6.0\n",
       "10   7.0\n",
       "11   8.0\n",
       "13   9.0\n",
       "14  10.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Eliminación de las filas con valores faltantes\n",
    "\n",
    "Ejemplo_5=Ejemplo.copy()\n",
    "Ejemplo_5.dropna(subset=['A'], inplace=True)\n",
    "Ejemplo_5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Corrección de tipos de datos\n",
    "\n",
    "El primer paso en el proceso de limpieza de datos es corregir los tipos de datos. A menudo, los datos se almacenan en formatos que no son adecuados para el análisis. Por ejemplo, una columna que debería ser numérica se almacena como una cadena. En esta sección, aprenderemos a corregir los tipos de datos de un DataFrame.\n",
    "\n",
    "```{note}\n",
    "Para corregir los tipos de datos de un DataFrame, utilizaremos el método `astype()` de Pandas. Este método nos permite convertir el tipo de datos de una columna a un tipo de datos específico. Por ejemplo, si queremos convertir una columna a un tipo de datos numérico, podemos usar el siguiente código:\n",
    "\n",
    "```python\n",
    "df['column_name'] = df['column_name'].astype('float')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "month\n",
       "may    13766\n",
       "jul     6895\n",
       "aug     6247\n",
       "jun     5341\n",
       "nov     3970\n",
       "apr     2932\n",
       "feb     2649\n",
       "jan     1403\n",
       "oct      738\n",
       "sep      579\n",
       "mar      477\n",
       "dec      214\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Variable day_of_month\n",
    "\n",
    "data['day_of_month'].value_counts(dropna=False)\n",
    "data['month'].value_counts(dropna=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         5\n",
       "1         5\n",
       "2         5\n",
       "3         5\n",
       "4         5\n",
       "         ..\n",
       "45206    17\n",
       "45207    17\n",
       "45208    17\n",
       "45209    17\n",
       "45210    17\n",
       "Name: day_of_month, Length: 45211, dtype: object"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Hagamos una cadena de texto de la forma dd-mmm\n",
    "## debemos cambiar el tipo de dato de day_of_month como una cadena de texto (string)\n",
    "\n",
    "data['day_of_month']=data['day_of_month'].astype(str)\n",
    "data['day_of_month']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         5-may-2023\n",
       "1         5-may-2023\n",
       "2         5-may-2023\n",
       "3         5-may-2023\n",
       "4         5-may-2023\n",
       "            ...     \n",
       "45206    17-nov-2023\n",
       "45207    17-nov-2023\n",
       "45208    17-nov-2023\n",
       "45209    17-nov-2023\n",
       "45210    17-nov-2023\n",
       "Name: day_month, Length: 45211, dtype: object"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Creamos la variable day_month\n",
    "\n",
    "data['day_month']=data['day_of_month']+'-'+data['month']+'-2023'\n",
    "\n",
    "data['day_month']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       2023-05-05\n",
       "1       2023-05-05\n",
       "2       2023-05-05\n",
       "3       2023-05-05\n",
       "4       2023-05-05\n",
       "           ...    \n",
       "45206   2023-11-17\n",
       "45207   2023-11-17\n",
       "45208   2023-11-17\n",
       "45209   2023-11-17\n",
       "45210   2023-11-17\n",
       "Name: day_month, Length: 45211, dtype: datetime64[ns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## La convertimos a tipo de dato fecha\n",
    "\n",
    "data['day_month']=pd.to_datetime(data['day_month'], format='%d-%b-%Y')\n",
    "data['day_month']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        298\n",
       "1        298\n",
       "2        298\n",
       "3        298\n",
       "4        298\n",
       "        ... \n",
       "45206    102\n",
       "45207    102\n",
       "45208    102\n",
       "45209    102\n",
       "45210    102\n",
       "Name: days_to_today, Length: 45211, dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Cambiar el tipo de dato trae sus ventajas\n",
    "from datetime import datetime\n",
    "\n",
    "today=datetime.now()\n",
    "\n",
    "data['days_to_today']=(today-data['day_month']).dt.days\n",
    "data['days_to_today']"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}